Looking for a comprehensive guide on going from zero to Apache Spark hero in steps? Look no further! Written by our friends at Databricks, this exclusive guide provides a solid foundation for those looking to master Apache Spark 2.0.

By Jules S. Damji & Sameer Farooqui, Databricks.

Not a week goes by without a mention of Apache Spark in a blog, news article, or webinar on Spark’s impact in the big data landscape. Not a meetup or conference on big data or advanced analytics is without a speaker that expounds on aspects of Spark—touting of its rapid adoption; speaking of its developments; explaining of its uses cases, in enterprises across industries.

All rightly so, and for good reason, as the Spark Survey 2015 showed that Spark’s growth is uncontested and unstoppable.

But what’s the allure? And how do you get started with a new computing platform is a burning and consuming question for any beginner. Consider these seven necessities as a gentle introduction to understanding Spark’s attraction and mastering Spark—from concepts to coding.

Step 1: Why Apache Spark?

For one, Apache Spark is the most active open source data processing engine built for speed, ease of use, and advanced analytics, with over 1000 contributors from over 250 organizations and a growing community of developers and users. Second, as a general purpose compute engine designed for distributed data processing at scale, Spark supports multiple workloads through a unified engine comprised of Spark components as libraries accessible via APIs in popular programing languages, including Scala, Java, Python, and R. And finally, it can be deployed in different environments, read data from various data sources, and interact with myriad applications.

Spark

All together, this unified compute engine makes Spark an ideal environment for diverse workloads—ETL, interactive queries (Spark SQL), advanced analytics (Machine Learning), graph processing (GraphX/GraphFrames), and Streaming (Structured Streaming)—all running within the same engine.

Spark

In the subsequent steps, you will get an introduction to some of these components, but first let’s capture key concepts and key terms.

Step 2: Apache Spark Concepts, Key Terms and Keywords

In June this year, KDnuggets published Apache Spark Key terms explained, which is a fitting introduction here. Add to this vocabulary the following Spark’s architectural terms, as they are referenced in this article.

Spark Cluster
A collection of machines or nodes in the cloud or on-premise in a data center on which Spark is installed. Among those machines are Spark workers, a Spark Master (also a cluster manager in a Standalone mode), and at least one Spark Driver.

Spark Master
As the name suggests, Spark master JVM acts as a cluster manager in a Standalone deployment mode to which Spark workers register themselves as part of a quorum. Depending on the deployment mode, it acts as a resource manager and decides where and how many Executors to launch, and on what Spark workers in the cluster.

Spark Worker
The Spark worker JVM, upon receiving instructions from Spark master, launches executors on the worker on behalf of the Spark driver. Spark applications, decomposed into units of tasks, are executed on each worker’s Executor. In short, the worker’s job is to only launch an Executor on behalf of the master.

Spark Executor
It’s a JVM container with an allocated amount of cores and memory on which Spark runs its tasks. Each worker node launches its own Spark Executor, with a configurable number of cores (or threads). Besides executing Spark tasks, an Executor also stores and caches all data partitions in memory.

Spark Driver
Once it gets information from the Spark master of all the workers in the cluster and where they are, the driver program distributes Spark tasks to each worker’s Executor. The driver also receives computed results from each Executor’s tasks.

Spark Cluster

Fig 1. Spark Cluster

SparkSession and SparkContext
As shown in the diagram, a SparkContext is a conduit to access all Spark functionality; only a single SparkContext exists per JVM. The Spark driver program uses it to connect to the cluster manager to communicate, and submit Spark jobs. It allows you to configure Spark configuration parameters. And through SparkContext, the driver can instantiate other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.

However, with Apache Spark 2.0, SparkSession can access all aforementioned Spark’s functionality through a single-unified point of entry. As well as making it simpler to access Spark functionality, it also subsumes the underlying contexts to manipulate data.

A recent blog post on How to Use SparkSessions in Apache Spark 2.0 explains this in details.

SparkContext

Fig 2. SparkContext and its interaction with Spark components

Spark Deployment Modes
Spark supports four cluster deployment modes, each with its own characteristics with respect to where Spark’s components run within a Spark cluster. Of all modes, the local mode, running on a single host, is by far the simplest.

As a beginner or intermediate developer you don’t need to know this elaborate matrix right away. It’s here for your reference, and the links provide additional information. Furthermore, Step 5 is a deep dive into all aspects of Spark Architecture.

Mode	Driver	Worker	Executor	Master
Local	Runs on a single JVM	Runs on the same JVM as the driver	Runs on same JVM as the driver	Runs on the single host
Standalone	Can run on any node in the cluster	Runs on its own JVM on each node	Each worker in the cluster will launch its own JVM	Can be allocated arbitrarily where the master is started.
YARN (client)	On a client, not part of the cluster	YARN NodeManager	YARN’s NodeManager’s Container	YARN’s Resource Manager works with YARN’s Application Master to allocate the containers on NodeManagers for Executors.
YARN (cluster)	Runs within the YARN’s Application Master	Same as client mode	Same as client mode	Same as client mode
Mesos (client)	Runs on a client machine, not part of Mesos cluster	Runs on Mesos Slave	Container within Mesos Slave	Mesos’ master
Mesos (cluster)	Runs within one of Mesos’ master	Same as client mode	Same as client mode	Mesos’ master

Table 1. Depicting deployment modes and where each components run

Spark Apps, Jobs, Stages and Tasks
An anatomy of a Spark application usually comprises of Spark operations, which can be either transformations or actions on your data sets using Spark’s RDDs, DataFrames or Datasets APIs. For example, in your Spark app, if you invoke an action, such as collect() or take() on your DataFrame or Dataset, the action will create a Job. A job will then be decomposed into single or multiple stages; stages are further divided into individual tasks; and tasks are units of execution that the Spark driver’s scheduler ships to Spark Executors on the Spark worker nodes to execute in your cluster. Often multiple tasks will run in parallel on the same executor, each processing its unit of partitioned dataset in its memory.

In this informative part of the video, Sameer Farooqui elaborates all the distinct stages in vivid details.

Fig 3. Anatomy of a Spark Application

Step 3: Advanced Apache Spark Core

To understand how all Spark components interact, it’s essential to grasp Spark’s core architecture in details. All the key terms and concepts defined above (and more) come to life when you hear and see them explained. In this Spark Summit training video, you can immerse yourself and take the journey into Spark’s core.

Step 4: DataFrames, Datasets and Spark SQL Essentials

In step 3, you might have learned about Resilient Distributed Datasets (RDDs)—if you watched its linked video—because they form the core data abstraction concept in Spark and underpin all other higher-level data abstractions and APIs, including DataFrames and Datasets.

In Spark 2.0, DataFrames and Datasets, built upon RDDs, form the core high-level and structured distributed data abstraction, across most libraries and components in Spark. DataFrames are named data columns in Spark and they can impose a schema in how your data is organized, and how you would process data or express a computation or issue a query. And Datasets go one step further to provide you strict compile-time type safety, so certain type of errors are caught at compile time rather than runtime.

Because of structure in your data and type of data, Spark can understand how you would express your computation, what particular typed-columns or typed-named fields you would access in your data, and what domain specific operations you may use. As a result, Spark will optimize your code, through Spark 2.0’s Catalyst optimizer, and generate efficient byte code through Project Tungsten.

DataFrames and Datasets offer high-level domain specific language APIs, making your code expressive and allowing high-level operators like filter, sum, count, avg, min, max etc. Whether you express your computations in Spark SQL or Python, Java, Scala, or R, the underlying code generated is identical because all execution planning undergoes the same Catalyst optimizer.

For example, this high-level domain specific code in Scala or its equivalent relational query in SQL will generate identical code. Consider a Dataset Scala object called Person and an SQL table “person.”

To get a world-wind introduction of why Structuring data in Spark is important and why DataFrames and Datasets and Spark SQL provide an efficient way to program Spark, we urge you to watch this Spark Summit talk by Michael Armbrust, Spark PMC and committer, in which he articulates the motivations and merits behind structure in Spark.

In addition, these couple of blogs discuss DataFrames and Datasets, and how to use them in processing structured data like JSON files and issuing Spark SQL queries.

Step 5: Graph Processing with GraphFrames

Even though Spark has a general purpose graph RDD-based processing library GraphX, which is optimized for distributed computing and supports graph algorithms, it has some challenges. It has no Java and Python APIs, and it’s based on low-level RDD APIs. Because of these challenges, it cannot take advantage of recent performance and optimization introduced in DataFrames, through Project Tungsten and Catalyst Optimizer.

By contrast, the DataFrame-based GraphFrames address all these challenges: It provides an analogous library to GraphX but with high-level, expressive and declarative APIs, in Java, Scala and Python; an ability to issue powerful SQL like queries using DataFrames APIs; saving and loading graphs; and takes advantage of underlying performance and query optimizations in Spark 2.0. Moreover, it integrates well with GraphX. That is, you can seamlessly convert a GraphFrame into an equivalent GraphX representation.

In the Graph diagram below, representing airport codes in their cities, all the vertices can be represented as rows of DataFrames; likewise, all the edges as rows of DataFrames, with their respective named and typed columns. Collectively, these DataFrames of vertices and edges comprise a GraphFrame.

With GraphFrames you can express three kinds of powerful queries. First simple SQL type of queries on vertices and edges such as what trips are likely to have major delays. Second, graph type queries such as how many vertices have incoming and outgoing edges. And finally, motif queries, by providing a structural pattern or path of vertices and edges and then finding those patterns in your graph’s dataset.

Additionally, GraphFrames easily support all graph algorithms supported in GraphX. For example, find important vertices using PageRank. Or determine the shortest path from source to destination. Or perform a Breadth First Search (BFS). And determine strongly connected vertices, for exploring social connections.

In the webinar link below, Joseph Bradley, Spark Committer, gives an illuminative introduction to graph processing with GraphFrames, its motivations and ease of use, and the benefits of its DataFrame-based API. And through a demonstrated notebook as part of the webinar, you’ll learn the ease with which you can use GraphFrames and issue all the aforementioned types of queries and types of algorithms.

Complementing the above webinar, two instructive blogs, with their accompanying notebooks, offer an introductory and hands-on experience with DataFrame-based GraphFrames.

Going forward with Apache Spark 2.0, many Spark components, including Machine Learning MLlib and Streaming, are increasingly moving toward offering equivalent DataFrames APIs, because of performance gains, ease of use, and high-level abstraction and structure. Where necessary or appropriate for your use case, you may elect to use GraphFrames instead of GraphX. Below is a succinct summary and comparison between GraphX and GraphFrames.

Finally, GraphFrames continue to get faster, and a Spark Summit talk by Ankur Dave shows specific optimizations. A newer version of GraphFrame package compatible with Spark 2.0 is available as a spark package.

Step 6: Structured Streaming with Infinite DataFrames

For much of Spark’s short history, Spark streaming has continued to evolve, to simplify writing streaming applications. Today, developers need more than just a streaming programming model to transform elements in a stream. Instead, they need a streaming model that supports an end-to-end applications that continuously react to data in real-time. We call them continuous applications that react to data in real-time.

Continuous applications have many facets–interacting both with batch and real-time data, performing ETL, serving data to a dashboard from batch and stream, or doing online machine learning by combining static dataset with real-time data. Currently such facets are handled by separate applications rather than a single one.

Apache Spark 2.0 lays foundational steps for a new higher-level API, Structured Streaming, for building continuous applications.

Central to Structured Streaming is the notion that you treat a stream of data as unbounded table. As new data arrives from the stream, new rows of the DataFrame are appended to an unbounded table:

You can perform computations or issue SQL type query operations on your unbounded table as you would on a static table. In this scenario, developers can express their streaming computations just like batch computations, and Spark will automatically execute it incrementally as data arrives in the stream.

Based on DataFrames/Datasets API, a cool benefit of using the Structured Streaming API is that your DataFrame/SQL based query for a batch DataFrame is similar to a streaming one, as you can see in the code in Fig 9., with a minor change. In the batch version, we read a static bounded log file, whereas in the streaming version, we read off an unbounded stream. Though the code looks deceptively simple, all the complexity is hidden from a developer and handled by the underlying model and execution engine, which is explained in the video talk.

After you take a deep dive into Structured Streaming in the video talk, also read the Structure Streaming Programming Model, which elaborates all under-the-hood complexity of data integrity, fault tolerance, exactly-once semantics, window-based aggregation, and out-of-order data. As a developer or user, you need not worry about them.

Learn further about Structured Streaming directly from Spark committer Tathagata Das, and try the accompanying notebook to get some hands-on experience on your first Structure Streaming continuous application.

Similarly, the Structured Streaming Programming Guide offers short examples on how to use supported sinks and sources:

Step 7: Machine Learning for Humans

At a human level, machine learning is all about applying statistical learning techniques and algorithms to a large data set to identify patterns, and from these patterns make probabilistic predictions. A simplified view of a model is a mathematical function f(x); with a large data set as input, the function f(x) is repeatedly applied to the data set to produce an output with a prediction.

For key terms of machine learning, Matthew Mayo’s Machine Learning Key Terms, Explained is a valuable reference for understanding some concepts discussed in the webinar link below.

Apache Spark’s DataFrame-based MLlib provides a set of algorithms as models and utilities, allowing data scientists to build machine learning pipelines easily. Borrowed from the scikit-learn project, MLlib pipelines allow developers to combine multiple algorithms into a single pipeline or workflow. Typically running machine learning algorithms involves a sequence of tasks, including pre-processing, feature extraction, model fitting, and validation stages. In Spark 2.0 this pipeline can be persisted and reloaded again, across languages Spark supports (see the blog link below).

In the webinar on Apache Spark MLlib, you will get a quick primer on machine learning, Spark MLlib, and an overview of some Spark machine learning use cases, along with how other common data science tools such as Python, pandas, scikit-learn and R integrate with MLib.

Moreover, two accompanying notebooks for some hands-on experience and a blog on persisting machine learning models will give you insight into why, what and how machine learning plays a crucial role in advanced analytics.

If you follow these steps, watch all the videos, read the blogs, and try out the accompanying notebooks, we believe that you will be on your way to master Spark 2.0.

Jules S. Damji is a Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.

Sameer Farooqui is a Technology Evangelist at Databricks where he helps developers use Apache Spark by hosting webinars, writing blogs and speaking at conferences and meetups.